Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Abstract

Recent advances in image reasoning methods, particularly "Thinking withImages", have demonstrated remarkable success in Multimodal Large LanguageModels (MLLMs); however, this dynamic reasoning paradigm has not yet beenextended to video reasoning tasks. In this paper, we propose Video-Thinker,which empowers MLLMs to think with videos by autonomously leveraging theirintrinsic "grounding" and "captioning" capabilities to generate reasoning cluesthroughout the inference process. To spark this capability, we constructVideo-Thinker-10K, a curated dataset featuring autonomous tool usage withinchain-of-thought reasoning sequences. Our training strategy begins withSupervised Fine-Tuning (SFT) to learn the reasoning format, followed by GroupRelative Policy Optimization (GRPO) to strengthen this reasoning capability.Through this approach, Video-Thinker enables MLLMs to autonomously navigategrounding and captioning tasks for video reasoning, eliminating the need forconstructing and calling external tools. Extensive experiments demonstrate thatVideo-Thinker achieves significant performance gains on both in-domain tasksand challenging out-of-domain video reasoning benchmarks, includingVideo-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7Bsubstantially outperforms existing baselines such as Video-R1 and establishesstate-of-the-art performance among 7B-sized MLLMs.

Quick Read (beta)

loading the full paper ...